Lexicon Stratification for Translating Out-of-Vocabulary Words

نویسندگان

  • Yulia Tsvetkov
  • Chris Dyer
چکیده

A language lexicon can be divided into four main strata, depending on origin of words: core vocabulary words, fullyand partiallyassimilated foreign words, and unassimilated foreign words (or transliterations). This paper focuses on translation of fullyand partially-assimilated foreign words, called “borrowed words”. Borrowed words (or loanwords) are content words found in nearly all languages, occupying up to 70% of the vocabulary. We use models of lexical borrowing in machine translation as a pivoting mechanism to obtain translations of out-of-vocabulary loanwords in a lowresource language. Our framework obtains substantial improvements (up to 1.6 BLEU) over standard baselines.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A hybrid language model for open-vocabulary Thai LVCSR

This paper investigates the use of a hybrid language model for open-vocabulary Thai LVCSR. Thai text is written without word boundary markers and the definition of word unit is often ambiguous due to the presence of compound words. Hence, to build open-vocabulary LVCSR, a very large lexicon is required to also handle word unit ambiguity. Pseudomorpheme (PM), a syllable-like sub-word unit specif...

متن کامل

Generation of Adaptive Vocabulary Lexicon for Japanese LVCSR

One of the thorniest problems of large vocabulary continuous speech recognition systems is the large number of out-of-vocabulary (OOV) words. This is especially the case for the languages like Japanese, which has many inflections, compound words and loanwords. The OOV words vary with the application domains. It's not realistic to have a big general-purpose lexicon including any possible 00V wor...

متن کامل

Local Methods for On-Demand Out-of-Vocabulary Word Retrieval

Most of the Web-based methods for lexicon augmenting consist in capturing global semantic features of the targeted domain in order to collect relevant documents from the Web. We suggest that the local context of the out-of-vocabulary (OOV) words contains relevant information on the OOV words. With this information, we propose to use the Web to build locally-augmented lexicons which are used in ...

متن کامل

The Effect of Lexicon-based Debates on the Felicity of Lexical Equivalents in Translating Literary Texts by Iranian EFL Learners

This study was an attempt to investigate the effect of lexicon-based debates on the felicity of lexical equivalents in translating literary texts by Iranian EFL learners.  To fulfill the purpose of this study, 59 university students, majoring in English Translation, were randomly assigned to the experimental and control groups from a total of 73 students based on their performance on a mock TOE...

متن کامل

Capturing Out-of-Vocabulary Words in Arabic Text

The increasing flow of information between languages has led to a rise in the frequency of non-native or loan words, where terms of one language appear transliterated in another. Dealing with such out of vocabulary words is essential for successful cross-lingual information retrieval. For example, techniques such as stemming should not be applied indiscriminately to all words in a collection, a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015